Adaptive Data Replication Scheme Based on Access Count Prediction in Hadoop
نویسندگان
چکیده
Hadoop, an open source implementation of the MapReduce framework, has been widely used for processing massive-scale data in parallel. Since Hadoop uses a distributed file system, called HDFS, the data locality problem often happens (i.e., a data block should be copied to the processing node when a processing node does not possess the data block in its local storage), and this problem leads to the decrease in performance. In this paper, we present an Adaptive Data Replication scheme based on Access count Prediction (ADRAP) in a Hadoop framework to address the data locality problem. The proposed data replication scheme predicts the next access count of data files using Lagrange’s interpolation with the previous data access count. With the predicted data access count, our adaptive data replication scheme determines whether it generates a new replica or it uses the loaded data as cache selectively, optimizing the replication factor. Furthermore, we provide a replica placement algorithm to improve data locality effectively. Performance evaluations show that our adaptive data replication scheme reduces the task completion time in the map phase by 9.6% on average, compared to the default data replication setting in Hadoop. With regard to data locality, our scheme offers the increase of node locality by 6.1% and the decrease of rack and rack-off locality by 45.6% and 56.5%, respectively. Hadoop, Data locality, Access prediction, Data replication, Data placement
منابع مشابه
Efficient Data Replication Scheme based on Hadoop Distributed File System
Hadoop distributed file system (HDFS) is designed to store huge data set reliably, has been widely used for processing massive-scale data in parallel. In HDFS, the data locality problem is one of critical problem that causes the performance decrement of a file system. To solve the data locality problem, we propose an efficient data replication scheme based on access count prediction in a Hadoop...
متن کاملAn Experimental Evaluation of Performance of A Hadoop Cluster on Replica Management
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing. A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File System termed as HDFS. The HDFS similar to most distributed file systems sh...
متن کاملAdaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments
Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...
متن کاملDynamic Replication based on Firefly Algorithm in Data Grid
In data grid, using reservation is accepted to provide scheduling and service quality. Users need to have an access to the stored data in geographical environment, which can be solved by using replication, and an action taken to reach certainty. As a result, users are directed toward the nearest version to access information. The most important point is to know in which sites and distributed sy...
متن کاملDelay Scheduling Based Replication Scheme for Hadoop Distributed File System
The data generated and processed by modern computing systems burgeon rapidly. MapReduce is an important programming model for large scale data intensive applications. Hadoop is a popular open source implementation of MapReduce and Google File System (GFS). The scalability and fault-tolerance feature of Hadoop makes it as a standard for BigData processing. Hadoop uses Hadoop Distributed File Sys...
متن کامل